Learning Deterministic Regular Expressions for the 
Inference of Schemas from XML Data 



GEERT JAN BEX, WOUTER GELADE, FRANK NEVEN 
Hasselt University and Transnational University of Limburg 
and 

STUN VANSUMMEREN 
Universite Libre de Bruxelles 



Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML 
documents essentially reduces to learning deterministic regular expressions from sets of positive 
example words. Unfortunately, there is no algorithm capable of learning the complete class of 
deterministic regular expressions from positive examples only, as we will show. The regular ex- 
pressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol 
occurs only a small number of times. As such, in practice it suffices to learn the subclass of 
deterministic regular expressions in which each alphabet symbol occurs at most k times, for some 
small k. We refer to such expressions as fc-occurrence regular expressions (fc-OREs for short). 
Motivated by this observation, we provide a probabilistic algorithm that learns fc-OREs for in- 
creasing values of k, and selects the deterministic one that best describes the sample based on a 
Minimum Description Length argument. The effectiveness of the method is empirically validated 
both on real world and synthetic data. Furthermore, the method is shown to be conservative over 
the simpler classes of expressions considered in previous work. 

Categories and Subject Descriptors: F.4.3 [Mathematical Logic and Formal Languages]: 
Formal Languages; 1.2.6 [Artificial Intelligence]: Learning; 1.7.2 [Document and Text Pro- 
cessing]: Document Preparation 

General Terms: Algorithms, Languages, Theory 
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1. INTRODUCTION 

Recent studies stipulate that schemas accompanying collections of XML documents 
are sparse and erroneous in practice. Indeed, Barbosa et al. [2005] and Mignet et al. 
[2003] have shown that approximately half of the XML documents available on the 
web do not refer to a schema. In addition, Bex et al. [2004] and Martens et al. 
[2006] have noted that about two-thirds of XML Schema Definitions (XSDs) gath- 
ered from schema repositories and from the web at large are not valid with respect 
to the W3C XML Schema specification [Thompson et al. 2001], rendering them 
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< ! ELEMENT store (order* , stock)> 

< ! ELEMENT order (customer, item+)> 

< ! ELEMENT customer (first, last, email*)> 

<! ELEMENT item (id, price + (qty, (supplier + item+)))> 

< ! ELEMENT stock (item*)> 

<! ELEMENT supplier (first, last, email*)> 

Fig. 1. An example DTD. 

essentially useless for immedidate application. A similar observation was made by 
Sahuguct [2000] concerning Document Type Definitions (DTDs). Nevertheless, the 
presence of a schema strongly facilitates optimization of XML processing (cf., e.g., 
[Benedikt et al. 2005; Che et al. 2006; Du et al. 2004; Freire et al. 2002; Koch et al. 
2004; Manolescu et al. 2001; Neven and Schwentick 2006]) and various software 
development tools such as Castor [cas ] and SUN's JAXB [jax ] rely on schemas 
as well to perform object-relational mappings for persistence. Additionally, the 
existence of schemas is imperative when integrating (meta) data through schema 
matching [Rahm and Bernstein 2001] and in the area of generic model manage- 
ment [Bernstein 2003]. 

Based on the above described benefits of schemas and their unavailability in 
practice, it is essential to devise algorithms that can infer a DTD or XSD for a 
given collection of XML documents when none, or no syntactically correct one, is 
present. This is also acknowledged by Florescu [2005] who emphasizes that in the 
context of data integration 

"We need to extract good-quality schemas automatically from existing 
data and perform incremental maintenance of the generated schemas. " 

As illustrated in Figure 1, a DTD is essentially a mapping d from element names 
to regular expressions over element names. An XML document is valid with respect 
to the DTD if for every occurrence of an element name e in the document, the 
word formed by its children belongs to the language of the corresponding regular 
expression d{e). For instance, the DTD in Figure 1 requires each store element 
to have zero or more order children, which must be followed by a stock clement. 
Likewise, each order must have a customer child, which must be followed by one 
or more item elements. 

To infer a DTD from a corpus of XML documents C it hence suffices to look, 
for each element name e that occurs in a document in C, at the set of element 
name words that occur below e in C, and to infer from this set the corresponding 
regular expression d(e). As such, the inference of DTDs reduces to the inference 
of regular expressions from sets of positive example words. To illustrate, from the 
words id price, id qty supplier, and id qty item item appearing under <item> 
elements in a sample XML corpus, we could derive the rule 

item — > (id, price + (qty, (supplier + item + ))). 

Although XSDs are more expressive than DTDs, and although XSD inference is 
therefore more involved than DTD inference, derivation of regular expressions re- 
mains one of the main building blocks on which XSD inference algorithms are built. 
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In fact, apart from also inferring atomic data types, systems like Trang [Clark ] and 
XStruct [Hegewald et al. 2006] simply infer DTDs in XSD syntax. The more recent 
iXSD algorithm [Bex et al. 2007] does infer true XSD schemas by first deriving a 
regular expression for every context in which an element name appears, where the 
context is determined by the path from the root to that element, and subsequently 
reduces the number of contexts by merging similar ones. 

So, the effectiveness of DTD or XSD schema inference algorithms is strongly 
determined by the accuracy of the employed regular expression inference method. 
The present article presents a method to reliably learn regular expressions that 
are far more complex than the classes of expressions previously considered in the 
literature. 

1.1 Problem setting 

In particular, let S be a fixed set of alphabet symbols (also called element names), 
and let E* be the set of all words over E. 

Definition 1.1 (Regular Expressions). Regular expressions are derived by the fol- 
lowing grammar. 

r, s ::— $\e\a\r.s\r + s\r?\r + 

Here, parentheses may be added to avoid ambiguity; e denotes the empty word; 
a ranges over symbols in S; r .s denotes concatenation; r + s denotes disjunction; 
r + denotes one-or-more repetitions; and r? denotes the optional regular expression. 
That is, the language £(r) accepted by regular expression r is given by: 

£(0) = £{e) = {e} 

C{a) — {a} C(r . s) — {vw | v £ £(r), w € £{s)} 

C(r + s) = £{r) U £(s) £(r + ) = { v i ■ ■ - v n \ n > 1 and vi, . . . ,v n e £(r)} 

£(r?) = £(r)U{e}. □ 

Note that the Kleene star operator (denoting zero or more repititions as in r* ) is 
not allowed by the above syntax. This is not a restriction, since r* can always be 
represented as (r + )l or (r?)+. Conversely, the latter can always be rewritten into 
the former for presentation to the user. 

The class of all regular expressions is actually too large for our purposes, as both 
DTDs and XSDs require the regular expressions occurring in them to be deter- 
ministic (also sometimes called one-unambiguous [Briiggemann-Klcin and Wood 
1998]). Intuitively, a regular expression is deterministic if, without looking ahead 
in the input word, it allows to match each symbol of that word uniquely against a 
position in the expression when processing the input in one pass from left to right. 
For instance, (a + b)*a is not deterministic as already the first symbol in the word 
aaa could be matched by either the first or the second a in the expression. Without 
lookahead, it is impossible to know which one to choose. The equivalent expression 
b*a(b*a)*, on the other hand, is deterministic. 

Definition 1.2. Formally, let f stand for the regular expression obtained from r 
by replacing the ith occurrence of alphabet symbol a in r by a^ % \ for every i and 
a. For example, for r = b + a{ba + )l we have f = a^(b^a^ + )7. A regular 
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expression r is deterministic if there are no words wa^v and wa^v' in C(r) such 
that i 7^ j. □ 

Equivalently, an expression is deterministic if the Glushkov construction [Bruggeman- 
Klein 1993] translates it into a deterministic finite automaton rather than a non- 
deterministic one [Briiggemann-Klein and Wood 1998]. Not every non-deterministic 
regular expression is equivalent to a deterministic one [Briiggemann-Klein and 
Wood 1998]. Thus, semantically, the class of deterministic regular expressions 
forms a strict subclass of the class of all regular expressions. 

For the purpose of inferring DTDs and XSDs from XML data, we are hence in 
search of an algorithm that, given enough sample words of a target deterministic 
regular expression r, returns a deterministic expression r' equivalent to r. In the 
framework of learning in the limit [Gold 1967], such an algorithm is said to learn 
the deterministic regular expressions from positive data. 

Definition 1.3. Define a sample to be a finite subset of S* and let TZ be a subclass 
of the regular expressions. An algorithm M mapping samples to expressions in TZ 
learns TZ in the limit from positive data if (1) S C £(M(S)) for every sample S and 
(2) to every r € TZ we can associate a so-called characteristic sample S r C C(r) such 
that, for each sample S with S r C S C C(r), M(S) is equivalent to r. □ 

Intuitively, the first condition says that M must be sound; the second that M 
must be complete, given enough data. A class of regular expressions TZ is learnable 
in the limit from positive data if an algorithm exists that learns TZ. For the class of 
all regular expressions, it was shown by Gold that no such algorithm exists [Gold 
1967]. We extend this result to the class of deterministic expressions: 

Theorem 1.4. The class of deterministic regular expressions is not learnable in 
the limit from positive data. 

Proof. It was shown by Gold [1967, Theorem 1.8], that any class of regular 
expressions that contains all non-empty finite languages as well as at least one 
infinite language is not learnable in the limit from positive data. Since deterministic 
regular expressions like a* define an infinite language, it suffices to show that every 
non-empty finite language is definable by a deterministic expression. Hereto, let 
S be a finite, non-empty set of words. Now consider the prefix tree T for S. For 
example, if S = {a, aab, abc, aac}, we have the following prefix tree: 

aT 

b/c\ cY 

O O O 

Nodes for which the path from the root to that node forms a word in S are marked 
by double circles. In particular, all leaf nodes are marked. 

By viewing the internal nodes in T with two or more children as disjunctions; 
internal nodes in T with one child as conjunctions; and adding a question mark for 
every marked internal node in T, it is straightforward to transform T into a regular 
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expression. For example, with S and T as above we get r — a .(b . c + a .(b + c))?. 
Clearly, C{r) = S. Moreover, since no node in T has two edges with the same label, 
r must be deterministic. □ 

Theorem 1.4 immediately excludes the possibility for an algorithm to infer the 
full class of DTDs or XSDs. In practice, however, regular expressions occurring 
in DTDs and XSDs are concise rather than arbitrarily complex. Indeed, a study 
of 819 DTDs and XSDs gathered from the Cover Pages [Cover 2003] (including 
many high-quality XML standards) as well as from the web at large, reveals that 
regular expressions occurring in practical schemas are such that every alphabet 
symbol occurs only a small number of times [Martens et al. 2006]. In practice, 
therefore, it suffices to learn the subclass of deterministic regular expressions in 
which each alphabet symbol occurs at most fc times, for some small fc. We refer to 
such expressions as k-occurrence regular expressions. 

Definition 1.5. A regular expression is k-occurrence if every alphabet symbol 
occurs at most fc times in it. □ 

For example, the expressions customer . order" 1 " and (school + institute)" 1 " are 
both 1-occurrence, while id .(qty+id) is 2-occurrence (as id occurs twice). Observe 
that if r is fc-occurrence, then it is also /-occurrence for every I > k. To simplify 
notation in what follows, we abbreviate 'fc-occurrence regular expression' by fc-ORE 
and also refer to the 1-OREs as 'single occurrence regular expressions' or SOREs. 

1.2 Outline and Contributions 

Actually, the above mentioned examination shows that in the majority of the cases 
k = 1. Motivated by that observation, we have studied and suggested practical 
learning algorithms for the class of deterministic SOREs in a companion article [Bex 
et al. 2006]. These algorithms, however, can only output SOREs even when the 
target regular expression is not. In that case they always return an approximation 
of the target expressions. It is therefore desirable to also have learning algorithms 
for the class of deterministic fc-OREs with k > 2. Furthermore, since the exact 
k- value for the target expression, although small, is unknown in a schema inference 
setting, we also require an algorithm capable of determining the best value of k 
automatically. 

We begin our study of this problem in Section 3 by showing that, for each fixed k, 
the class of deterministic fc-OREs is learnable in the limit from positive examples 
only. We also argue, however, that this theoretical algorithm is unlikely to work 
well in practice as it does not provide a method to automatically determine the 
best value of k and needs samples whose size can be exponential in the size of the 
alphabet to successfully learn some target expressions. 

In view of these observations, we provide in Section 4 the practical algorithm 
iDRegEx. Given a sample of words S, zDRegEx derives corresponding determin- 
istic fc-OREs for increasing values of k and selects from these candidate expressions 
the expression that describes S best. To determine the "best" expression we pro- 
pose two measures: (1) a Language Size measure and (2) a Minimum Description 
Length measure based on the work of Adriaans and Vitanyi [2006] . The main tech- 
nical contribution lies in the subroutine used to derive the actual fc-OREs for S. 
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Indeed, while for the special case where k = 1 one can derive a fc-ORE by first 
learning an automaton A for S using the inference algorithm of Garcia and Vidal 
[1990], and by subsequently translating A into a 1-ORE (as shown in [Bex et al. 
2006]), this approach does not work when k > 2. In particular, the algorithm of 
Garcia and Vidal only works when learning languages that are "n-testable" for 
some fixed natural number n [Garcia and Vidal 1990]. Although every language 
definable by a 1-ORE is 2-testable [Bex et al. 2006], there are languages definable 
by a 2-ORE, for instance a*ba*, that are not n-testable for any n. We therefore 
use a probabilistic method based on Hidden Markov Models to learn an automaton 
for S, which is subsequently translated into a fc-ORE. 

The effectiveness of iDREGEx is empirically validated in Section 5 both on real 
world and synthetic data. We compare the results of iDRegEx with those of 
the algorithm presented in previous work [Bex et al. 2008], to which we refer as 
iDREGEx(RWR°). 

2. RELATED WORK 

Semi-structured data. In the context of semi-structured data, the inference of 
schemas as defined in [Buncman et al. 1997; Quass et al. 1996] has been exten- 
sively studied [Goldman and Widom 1997; Nestorov et al. 1998]. No methods were 
provided to translate the inferred types to regular expressions, however. 

DTD and XSD inference. In the context of DTD inference, Bex et al. [2006] 
gave in earlier work two inference algorithms: one for learning 1-OREs and one for 
learning the subclass of 1-OREs known as chain regular expressions. The latter 
class can also be learned using Trang [Clark ], state of the art software written 
by James Clark that is primarily intended as a translator between the schema 
languages DTD, Relax NG [Clark and Murata 2001], and XSD, but also infers a 
schema for a set of XML documents. In contrast, our goal in this article is to infer 
the more general class of deterministic expressions, xtract [Garofalakis et al. 
2003] is another regular expression learning system with similar goals. We note 
that XTRACT also uses the Minimum Description Length principle to choose the 
best expression from a set of candidates. 

Other relevant DTD inference research is [Sankey and Wong 2001] and [Chidlovskii 
2001] that learn finite automata but do not consider the translation to deterministic 
regular expressions. Also, in [Young-Lai and Tompa 2000] a method is proposed to 
infer DTDs through stochastic grammars where right-hand sides of rules are repre- 
sented by probabilistic automata. No method is provided to transform these into 
regular expressions. Although Ahonen [1996] proposes such a translation, the ef- 
fectiveness of her algorithm is only illustrated by a single case study of a dictionary 
example; no experimental study is provided. 

Also relevant are the XSD inference systems [Bex et al. 2007; Clark ; Hegewald 
et al. 2006] that, as already mentioned, rely on the same methods for learning 
regular expressions as DTD inference. 

Regular expression inference. Most of the learning of regular languages from 
positive examples in the computational learning community is directed towards in- 
ference of automata as opposed to inference of regular expressions [Angluin and 
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Smith 1983; Pitt 1989; Sakakibara 1997]. However, these approaches learn strict 
subclasses of the regular languages which are incomparable to the subclasses consid- 
ered here. Some approaches to inference of regular expressions for restricted cases 
have been considered. For instance, [Brazma 1993] showed that regular expressions 
without union can be approximately learned in polynomial time from a set of ex- 
amples satisfying some criteria. [Fernau 2005] provided a learning algorithm for 
regular expressions that are finite unions of pairwise left-aligned union-free regular 
expressions. The development is purely theoretical, no experimental validation has 
been performed. 

HMM learning. Although there has been work on Hidden Markov Model struc- 
ture induction [Rabiner 1989; Freitag and McCallum 2000], the requirement in our 
setting that the resulting automaton is deterministic is, to the best of our knowl- 
edge, unique. 

3. BASIC RESULTS 

In this section we establish that, in contrast to the class of all deterministic expres- 
sions, the subclass of deterministic fc-OREs can theoretically be learned in the limit 
from positive data, for each fixed fc. We also argue, however, that this theoretical 
algorithm is unlikely to work well in practice. 

Let E(r) denote the set of alphabet symbols that occur in a regular expression 
r, and let S(S') be similarly defined for a sample 5*. Define the length of a regu- 
lar expression r as the length of it string representation, including operators and 
parenthesis. For example, the length of (a . 6)+? + c is 9. 

Theorem 3.1. For every k there exists an algorithm M that learns the class of 
deterministic fc-OREs from positive data. Furthermore, on input S, M runs in 
time polynomial in the size of S, yet exponential in k and \T,(S)\. 

Proof. The algorithm M is based on the following observations. First observe 
that every deterministic fc-ORE r over a finite alphabet ACS can be simplified 
into an equivalent deterministic fc-ORE r' of length at most 10fc|A| by rewriting r 
according to the following system of rewrite rules until no more rule is applicable: 



((*)) 


-> 


(*) 


s?+ -> 


s+? 


sll 


-> 


si 


s++ -> 


s+ 


s + e 


-> 


sl 


£ + S -> 


sl 


s . e 




s 


£ . S — > 


s 


el 


-> 


e 


e+ -> 


£ 


s + 


-> 


s 


+ .s -> 


S 


8.9 


-> 





0.s ->■ 





0? 


-> 





0+ -+ 






(The first rewrite rule removes redundant parenthesis in r.) Indeed, since each 
rewrite rule clearly preserves determinism and language equivalence, r' must be a 
deterministic expression equivalent to r. Moreover, since none of the rewrite rules 
duplicates a subexpression and since r is a fc-ORE, so is r' . Now note that, since 
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no rewrite rule applies to it, r' is either 0, e, or generated by the following grammar 
t ::= a \ a? \ a+ | a+1 \ (a) | (a)? | (a)+ | (a)+? 

I h.t 2 \ (ti . t 2 ) I (ti • t 2 )? I (ti • t 2 ) + I {h . t 2 ) + ? 

I ti + 1 2 1 (h + t 2 ) | (ti + h)? | (*! + 1 2 )+ | (tj + i 2 ) + ? 

It is not difficult to verify by structural induction that any expression t produced 
by this grammar has length 

\t\ < -4+ 10 re P(M), 

aGS(t) 

where rep(t,a) denotes the number of times alphabet symbol a occurs in t. For 
instance, rep(b .(b + c), a) = and rep(b .(b + c), b) = 2. Since rep(r',a) < k for 
every a G S(r'), it readily follows that \r'\ < 10k\A\ - 4 < 10fc|A|. 

Then observe that all possible regular expressions over A of length at most 10fc| A\ 
can be enumerated in time exponential in k\A\. Since checking whether a regu- 
lar expression is deterministic is decidable in polynomial time [Briiggemann-Klcin 
and Wood 1998]; and since equivalence of deterministic expressions is decidable in 
polynomial time [Bruggcmann-Klcin and Wood 1998], it follows by the above ob- 
servations that for each k and each finite alphabet A C E it is possible to compute 
in time exponential in k\A\ a finite set TZa of pairwise non-equivalent deterministic 
/c-OREs over A such that 

— every r € TZa is of size at most 10fc|A|; and 

— for every deterministic fc-ORE r over A there exists an equivalent expression 

r> e Tl A . 

(Note that since TZa is computable in time exponential in k\A\, it has at most an 
exponential number of elements in fc|A|.) Now fix, for each finite A C S an arbitrary 
order -< on TZa, subject to the provision that r ~< s only if C(s) — C(r) ^ 0. Such 
an order always exists since TZa does not contain equivalent expressions. 

Then let M be the algorithm that, upon sample S, computes TZ^(s) and outputs 
the first (according to -<) expression r e 7^s(s) f° r which S C L(r). Since TZ^s) can 
be computed in time exponential in fc|S(5)|; since there are at most an exponential 
number of expressions in TZ-^isY, since each expression r <G TZ^,(s) has size at most 
10fc|S(5)|; and since checking membership in C(r) of a single word w G S can be 
done in time polynomial in the size of w and r, it follows that M runs in time 
polynomial in S and exponential in fc|S(5)|. 

Furthermore, we claim that M learns the class of deterministic fc-OREs. Clearly, 
S C £(M(S)) by definition. Hence, it remains to show completeness, i.e., that we 
can associate to each deterministic fc-ORE r a sample S r C L(r) such that, for each 
sample S with S r CSC L(r), M(S) is equivalent to r. Note that, by definition of 
7\L S ( r ), there exists a deterministic fc-ORE r' G 7^ S ( r ) equivalent to r. Initialize S r 
to an arbitrary finite subset of C(r) = C(r') such that each alphabet symbol of r 
occurs at least once in S, i.e., S(5 r ) = S(r). Let n ~< ■ ■ ■ ~< r n be all predecessors of 
r' in 7^s(r) according to -<. By definition of there exists a word Wi G C(r) — £(rj) 
for every 1 < i < n. Add all of these words to S r . Then clearly, for every sample S 
with S r CSC C(r) we have = S(r) and 5 ^ ^(n) for every 1 < i < n. Since 
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M(S) is the first expression in 1Z-z( r ) with S C L(r), we hence have M(S) = r' = r, 
as desired. □ 

While Theorem 3.1 shows that the class of deterministic fc-OREs is better suited 
for learning from positive data than the complete class of deterministic expressions, 
it does not provide a useful practical algorithm, for the following reasons. 

(1) First and foremost, M runs in time exponential in the size of the alphabet £(£), 
which may be problematic for the inference of schema's with many element 
names. 

(2) Second, while Theorem 3.1 shows that the class of deterministic fc-OREs is 
learnable in the limit for each fixed k, the schema inference setting is such that 
we do not know k a priori. If we overestimate k then M(S) risks being an under- 
approximation of the target expression r, especially when S is incomplete. 
To illustrate, consider the 1-ORE target expression r = a + b + and sample 
S = {ab, abbb, aabb}. If we overestimate k to, say, 2 instead of 1, then M is free 
to output aa?b + as a sound answer. On the other hand, if we underestimate k 
then M(S) risks being an over-approximation of r. Consider, for instance, the 
2-ORE target expression r = aalb + and the same sample S = {ab, abbb, aabb}. 
If we underestimate A; to be 1 instead of 2, then M can only output 1-OREs, 
and needs to output at least a + b + in order to be sound. In summary: we need 
a method to determine the most suitable value of k. 

(3) Third, the notion of learning in the limit is a very liberal one: correct expres- 
sions need only be derived when sufficient data is provided, i.e., when the input 
sample is a superset of the characteristic sample for the target expression r. 
The following theorem shows that there are reasonably simple expressions r 
such that characteristic sample S r of any sound and complete learning algo- 
rithm is at least exponential in the size of r. As such, it is unlikely for any 
sound and complete learning algorithm to behave well on real-world samples, 
which are typically incomplete and hence unlikely to contain all words of the 
characteristic sample. 

Theorem 3.2. Let A = {a\, . . . ,a n } C S consist of n distinct element names. 
Let n = (ai<22 + (I3 + • • • + a n ) + , and let r 2 = (a 2 + • • • + a n ) + a\(a2 + • • • + a n ) + . 
For any algorithm that learns the class of deterministic (2n + 3)-OREs and any 
sample S that is characteristic for n or r 2 we have \S\ > J27=i( n ~ ■ 

Proof. First consider n = (aia 2 + a$ + ■ ■ ■ + a n ) + . Observe that there exist 
an exponential number of deterministic (2n + 3)-OREs that differ from n in only 
a single word. Indeed, let B = A — {ai,a 2 } and let W consist of all non-empty 
words w over B of length at most n. Define, for every word w = bi . . . b m £ W the 
deterministic (2n + 3)-ORE r w such that C(r w ) = C{ri) — {w} as follows. First, 
define, for every 1 < i < m the deterministic 2-ORE r % w that accepts all words in 
C{r\) that do not start with b^. 

r l w := (oia 2 + (B - {bi})) .(ai<z 2 + a 3 H h a n )* 

Clearly, v <G C{r\) — {w} if, and only if, v G C{r\) and there is some < i < m 
such that v agrees with w on the first i letters, but differs in the (i + l)-th letter. 
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Hence, it suffices to take 

r w := rl + b 1 (e + r 2 w + b 2 (e + r 3 w + b 3 (- ■ ■ + & m _i(e + r™ + b m . n) . . . ))) 

Now assume that algorithm M learns the class of deterministic (2n + 3)-OREs and 
suppose that S ri is characteristic for n. In particular, S ri C C{r\). By definition, 
M(S) is equivalent to r for every sample S with S ri CSC £(ri). We claim that 
in order for M to have this property, W must be a subset of S r . Then, since W 
contains all words over B of length at most Ti, \S ri 

I > J2"=i( n ~ 2 )\ as desired. The 
intuitive argument why W must be a subset of S r is that if there exists w in W — S r , 
then M cannot distinguish between r± and r w . Indeed, suppose for the purpose 
of contradiction that there is some w £ W with w £ S ri . Then S ri is a subset of 
C(r w ). Indeed, S ri = S ri — {w} C C(r\) — {w} = C{r w ). Furthermore, since M 
learns the class of deterministic (2n + 3)-OREs, there must be some characteristic 
sample S rw for r w . Now, consider the sample S ri U S rw . It is included in both 
C{r\) and C{r w ) and is a superset of both S ri and S rw . But then, by definition of 
characteristic samples, M(S ri U S rm ) must be equivalent to both n and r w . This 
is absurd, however, since C{r\) 7^ C(r w ) by construction. 

A similar argument shows that the characteristic sample S r2 of r 2 = (a 2 + ■ ■ ■ + 
an) + ai(fl2 + • • • + a n ) + also requires X)"=i( ri — 2 ) J elements. In this case, we take 
B = A — {ai} and we take W to be the set of all non-empty words over B of 
length at most n. For each w = b\ . . .b m € W, we construct the deterministic 
(2n + 3)-ORE such that £(?*„,) accepts all words in C(r) that do not end with 
aiw, as follows. Let, for 1 < i < m, r l w be the 2-ORE that accepts all words in B + 
that do not start with be 

rl := (B-{b t }).B* 

Then it suffices to take 

r w := B+ ai (r l w + h(e + rl + b 3 (- ■ ■ + 6 m _i(e + r™ + b m B+) . . . ))). 

A similar argument as for r\ then shows that the characteristic sample 5 r2 of r 2 
needs to contain, for each w G W, at least one word of the form va\w with v G B + . 
Therefore, \S r2 \ > ELi( n _ as desired. □ 

4. THE LEARNING ALGORITHM 

In view of the observations made in Section 3, we present in this section a practical 
learning algorithm that (1) works well on incomplete data and (2) automatically 
determines the best value of k (see Section 5 for an experimental evaluation) . Specif- 
ically, given a sample S, the algorithm derives deterministic fc-OREs for increasing 
values of k and selects from these candidate expressions the fc-ORE that describes 
S best. To determine the "best" expression we propose two measures: (1) a Lan- 
guage Size measure and (2) a Minimum Description Length measure based on the 
work of Adriaans and Vitanyi [2006] . 

Our algorithm does not derive deterministic fc-OREs for S directly, but uses, for 
each fixed fc, a probabilistic method to first learn an automaton for S, which is sub- 
sequently translated into a fc-ORE. The following section (Section 4.1) explains how 
the probabilistic method that learns an automaton from S works. Section 4.2 ex- 
plains how the learned automaton is translated into a fc-ORE. Finally, Section 4.3, 
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introduces the whole algorithm, together with the two measures to determine the 
best candidate expression. 

4.1 Probabilistically Learning a Deterministic Automaton 

In particular, the algorithm first learns a deterministic k-occurrence automaton 
(deterministic fc-OA) for 5. This is a specific kind of finite state automaton in 
which each alphabet symbol can occur at most k times. Figure 2(a) gives an 
example. Note that in contrast to the classical definition of an automaton, no 
edges are labeled: all incoming edges in a state s are assumed to be labeled by the 
label of s. In other words, the 2-OA of Figure 2(a) accepts the same language as 
aalb+. 

Definition 4.1 (k-OA). An automaton is a node-labeled graph G = (V,E,lab) 
where 

—V is a finite set of nodes (also called states) with a distinguished source src e V 
and sink sink £ V; 

— the edge relation E is such that src has only outgoing edges; sink has only 
incoming edges; and every state v e V — {src, sink} is reachable by a walk from 
src to sink; 

— lab: V — {src, sink} — >• X is the labeling function. 

In this context, an accepting run for a word a\ . . . a n is a walk src s\ . . .s n sink 
from src to sink in G such that aj = lab(si) for 1 < i < n. As usual, we denote 
by £(G) the set of all words for which an accepting run exists. An automaton is 
k-occurrence (a fc-OA) if there are at most k states labeled by the same alphabet 
symbol. If G uses only labels in A C £ then G is an automaton over A. □ 

In what follows, we write Succ(s) for the set {t | (s, t) <E E} of all direct successors 
of state s in G, and Pred(s) for the set {t | (t, s) <G E} of all direct predecessors 
of s in G. Furthermore, we write Succ(s,a) and Pred(s,a) for the set of states in 
Succ(s) and Pred(s), respectively, that are labeled by a. As usual, an automaton G 
is deterministic if Succ(s, a) contains at most one state, for every s € V and a e E. 

For convenience, we will also refer to the 1-OAs as "single occurence automata" 
or SOAs for short. 

We learn a deterministic fc-OA for a sample S as follows. First, recall from 
Section 3 that is the set of alphabet symbols occurring in words in S. We view 
S as the result of a stochastic process that generates words from S* by performing 
random walks on the complete fc-OA Ck over S(S'). 

Definition 4.2. Define the complete fc-OA Ck over T,(S) to be the fc-OA G = 
(V, E, lab) over 11(5) in which each a € S(5) labels exactly k states such that 

— there is an edge from src to sink; 

— src is connected to exactly one state labeled by a, for every a <G E(5); and 
— every state s eV — {src, sink} has an outgoing edge to every other state except 
src. □ 

To illustrate, the complete 2-OA over {a, b} is shown in Figure 2(b). Clearly, 

r(c fc ) = £(S)*. 
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(a) An example 2-OA. It accepts (b) The complete 2-OA over 

the same language as aa!b + {a, &}. 

Fig. 2. Two 2-OAs. 

The stochastic process that generates words from S* by performing random walks 
on Cfe operates as follows. First, the process picks, among all states in Succ(src), 
a state S\ with probability a(src, si) and emits lab(si). Then it picks, among 
all states in Succ(si) a state s 2 with probability a(si,S2) and emits lab(s2)- The 
process continues moving to new states and emitting their labels until the final state 
is reached (which does not emit a symbol). Of course, a must be a true probability 
distribution, i.e., 

a(s,t)>0; and ^ a(s,t) = 1 (1) 

iGSucc(s) 

for all states s ^ sink and all states t. The probability of generating a particular 
accepting run s = sre sis 2 ■ ■ ■ s n sink given the process V = (Ck,a) in this setting 
is 

P[s | V] = a(src, si) • a(s 2 , S3) • a(s 2 , s 3 ) ■ ■ ■ ot(s n , sink), 
and the probability of generating the word w = a\ . . . a n is 

p[w\p]= p i g \^- 

all accepting runs sofuj in C'k 

Assuming independence, the probability of obtaining all words in the sample S is 
then 

P[S \P}=Y[P[w\ V]. 

t»6S 

Clearly, the process that best explains the observation of S is the one in which the 
probabilities a are such that they maximize P[S \ V\. 

To learn a deterministic fc-OA for S we therefore first try to infer from S the 
probability distribution a that maximizes P[S \ V], and use this distribution to 
determine the topology of the desired deterministic fc-OA. In particular, we remove 
from Cfe the non-deterministic edges with the lowest probability as these are the 
least likely to contribute to the generation of 5*, and are therefore the least likely 
to be necessary for the acceptance of S. 

The problem of inferring a from S is well-studied in Machine Learning, where 
our stochastic process V corresponds to a particular kind of Hidden Markov Model 
sometimes referred to as a Partially Observable Markov Model (POMM for short). 
(For the readers familiar with Hidden Markov Models we note that the initial 
state distribution 7r usually considered in Hidden Markov Models is absorbed in 
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Algorithm 1 iKOA 
Require: a sample S, a value for k 
Ensure: a deterministic fc-OA G with S C C(G) 
l: V <- init(fc,5) 

2: V <- BAUMWELSH(P, S) 
3: G 4- DISAMBIGUATE^, S) 
4: G <- PRUNE(G, S) 

5: return G 



Algorithm 2 Disambiguate 

Require: a POMM V = (G, a) and sample S 

Ensure: a deterministic fc-OA 

l: Initialize queue Q to {s € Succ(src) | a(src, s) > 0} 

2: Initialize set of marked states D <— 

3: while Q is non-empty do 

4: s 4- first(Q) 

5: while some a e S has | Succ(s, a)| > 1 do 

6: pick t e Succ(s, a) with a(s, t) = max{a(s, t') \ t' G Succ(s, a)} 

7: set a(s,t) 4- J2{ a ( s ^') I *' e Succ(s,a)} 

8: for all t' in Succ(s, a) \ {t} do 

9: delete edge (s,t') from G 

10: set a(s, if) <- 

11: "P BaumWelsh(-P, S 1 ) 
12: if S % C(G) then Fail 
13: add s to marked states D and pop s from Q 
14: enqueue all states in Succ(s) \ D to Q 
15: return G 



the state transition distribution a(src, •) in our context.) Inference of a is generally 
accomplished by the well-known Baum- Welsh algorithm [Rabiner 1989] that adjusts 
initial values for a until a (possibly local) maximum is reached. 

We use Baum- Welsh in our learning algorithm iKOA shown in Algorithm 1, which 
operates as follows. In line 1, «KOA initializes the stochastic process V to the tuple 
(Gfe, a) where 

— Gfe is the complete fc-OA over S(S); 

— a(src, sink) is the fraction of empty words in S; 

— a(src,s) is the fraction of words in S that start with lab(s), for every s € 
Succ(src); and 

— a(s,t) is chosen randomly for s ^ sre, subject to the constraints in equation (1). 

It is important to emphasize that, since we are trying to model a stochastic process, 
multiple occurrences of the same word in S are important. A sample should there- 
fore not be considered as a set in Algorithm 1, but as a bag. Line 2 then optimizes 
the initial values of a using the Baum- Welsh algorithm. 

With these probabilities in hand Disambiguate, shown in Algorithm 2, deter- 
mines the topology of the desired deterministic fc-OA for S. In a breadth-first 
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manner, it picks for each state s and each symbol a the state t <G Succ(s,a) with 
the highest probability and deletes all other edges to states labeled by a. Line 7 
merely ensures that a continues to be a probability distribution after this removal 
and line 11 adjusts a to the new topology. Line 12 is a sanity check that ensures 
that we have not removed edges necessary to accept all words in S; Disambiguate 
reports failure otherwise. The result of a successful run of Disambiguate is a 
deterministic fc-OA which nevertheless may have edges (s,i) for which there is no 
witness in S (i.e., a word in 5* whose unique accepting run traverses (s,i)). The 
function Prune in line 4 of zKoa removes all such edges. It also removes all states 
s e Succ(src) without a witness in S. Figure 3 illustrates a hypothetical run of 

iKOA. 

It should be noted that BaumWelsh, which iteratively refines a until a (pos- 
sibly local) maximum is reached, is computationally quite expensive. For that 
reason, our implementation only executes a fixed number of refinement iterations 
of BaumWelsh in Line 11. Rather surprisingly, this cut-off actually improves the 
precision of zDRegEx, as our experiments in Section 5 show, where it is discussed 
in more detail. 

4.2 Translating fc-OAs into fc-OREs 

Once we have learned a deterministic fc-OA for a given sample S using iKOA 
it remains to translate this fc-OA into a deterministic fc-ORE. An obvious ap- 
proach in this respect would be to use the classical state elimination algorithm 
(cf., e.g., [Hopcroft and Ullman 2007]). Unfortunately, as already hinted upon by 
Fernau [2004; 2005] and as we illustrate below, it is very difficult to get concise 
regular expressions from an automaton representation. For instance, the classical 
state elimination algorithm applied to the SOA in Figure 4 yields the expression: 1 

(aa*d + (c + aa*c)(c + aa*c)*(d + aa* d) + (b + aa*b + (c + 
aa*c)(c + aa*c)*(b + aa*b))(aa*b + (c + aa*c)(c + aa*c)* 
(b + aa*b))*(aa*d + (c + aa*c)(c + aa*c)*(d + aa*d)))(aa*d + 
(c + aa*c)(c + aa*c)*(d + aa*d) + (b + aa*b + (c + aa*c)(c + 
aa*c)*(b + aa*b))(aa*b + (c + aa*c)(c + aa*c)*(b + aa*b))* 

which is non-deterministic and differs quite a bit from the equivalent deterministic 
SORE 

((b?(a + c))+d)+e. 

Actually, results by Ehrcnfcucht and Zeiger [1976]; Gelade and Neven [2008]; and 
Gruber and Holzcr [2008] show that it is impossible in general to generate concise 
regular expressions from automata: there are fc-OAs (even for fc = 1) for which the 
number of occurrences of alphabet symbols in the smallest equivalent expression is 
exponential in the size of the automaton. For such automata, an equivalent fc-ORE 
hence does not exist. 

It is then natural to ask whether there is an algorithm that translates a given 
fc-OA into an equivalent fc-ORE when such a fc-ORE exists, and returns a fc-ORE 
super approximation of the input fc-OA otherwise. Clearly, the above example 
shows that the classical state elimination algorithm does not suffice for this purpose. 



1 Transformation computed by JFLAP: www.jflap.org. 
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(c) Process V after first disambiguation step 
(for ai). Edges to a\ and 62 are removed. 



(d) Process V after second disambiguation step 
(for bi). Edges to and &2 are removed. 




(e) Automaton A returned by 
Disambiguate. 



o 




(f) Automaton A returned by Prune. It 
accepts the same language as aa?b + . 



Fig. 3. Example run of zKoa for k — 2 with target language aalb + . For the process 
V in (c)-(f), the a values are listed in table-form. To distinguish different states 
with the same label, we have indexed the labels. 




Fig. 4. A SOA on which the classical state elimination algorithm returns a complicated expression. 
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6d) 



\5 

Fig. 5. An example marking 

For that reason, we have proposed in a companion article [Bex et al. ] a family 
of algorithms {RWR, RWR 2 , RWR^, rwr§, . . . } that translate SOAs into SOREs and 
have exactly these properties: 

Theorem 4.3 ([Bex et al. ]). Let G be a SOA and let T be any of the algo- 
rithms in the family {RWR, RWR 2 , RWr|, RWR 2 , . . . }. If G is equivalent to a SORE 
r, then T(G) returns a SORE equivalent to r. Otherwise, T(G) returns a SORE 
that is a super approximation of G, C{G) C C{T{G)). 

(Note that SOAs and SOREs are always deterministic by definition.) 

These algorithms, in short, apply an inverse Glushkov translation. Starting from 
a A;- OA where each state is labeled by a symbol, they iteratively rewrite subau- 
tomata into equivalent regular expressions. In the end only one state remains and 
the regular expression labeling this state is the output. 

In this section, we show how the above algorithms can be used to translate fc-OAs 
into fc-OREs. For simplicity of exposition, we will focus our discussion on RWRj as 
it is the concrete translation algorithm used in our experiments in Section 5, but 
the same arguments apply to the other algorithms in the family. 

Definition 4.4. First, let S( fe ) denote the alphabet that consists of k copies of 
the symbols in E, where the first copy of a e E is denoted by the second by 
and so on: 

:= {oW | aG E,l < i < /c}. 

Let strip be the function mapping copies to their original symbol, i.e., strip(a^) = 
a. We extend strip pointwise to words, languages, and regular expressions over 

For example, strip ({a^ a^b^, a^a^c^}) = {aab,aac} and strip (a (1 ) . a (2) ? . 
=a.o?.6+. 

To see how we can use RWRj, which translates SOAs into SOREs, to translate 
a fc-OA into a fc-ORE, observe that we can always transform a k-OA G over E 
into a SOA H over E( fc ) by processing the nodes of G in an arbitrary order and 
replacing the ith occurrence of label a e E by a^\ To illustrate, the SOA over E( 2 ) 
obtained in this way from the 2-OA in Figure 2(a) is shown in Figure 5. Clearly, 
C{G) = strip(£{H)). 

Definition 4.5. We call a SOA H over E^ fc ^ obtained from a k-OA G in the above 
manner a marking of G. □ 

Note that, by Theorem 4.3, running RWRj on H yields a SORE r over 
with C(H) C C(r). For instance, with H as in Figure 5, RWR 2 (i?) returns r — 
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Algorithm 3 rwr 2 

Require: a fc-OA G 

Ensure: a fc-ORE r with C(G) C C{r) 

l: compute a marking H of G. 

2: return strip(KWRl(H)) 



a^ 1 ) . a^ 2 )? . . By subsequently stripping r, we always obtain a fc-ORE over S. 
Moreover, C(G) = strip(C{H)) C strip{C(r)) — £(strip(r)), so the fc-ORE strip (r) 
is always a super approximation of G. Algorithm 3, called rwr 2 , summarizes the 
translation. By our discussion, rwr 2 is clearly sound: 

PROPOSITION 4.6. RWR 2 (G) is a (possibly non- deterministic) fc-ORE with C(G) C 
£(rwr 2 (G)) 7 for every fc-OA G. 

Note, however, that even when G is deterministic and equivalent to a determinis- 
tic fc-ORE r, RWR 2 (G) need not be deterministic, nor equivalent to r. For instance, 
consider the 2- OA G: 




Clearly, G is equivalent to the deterministic 2-ORE bc?a(ba) + 7 . Now suppose for 
the purpose of illustration that rwr 2 constructs the following marking H of G. (It 
does not matter which marking rwr 2 constructs, they all result in the same final 
expression.) 



O— ► "* a« —*Q 



/J 



,(l) 



6( 2 ) 



Since H is not equivalent to a SORE over £( fe ), rwr 2 (H) need not be equivalent 
to C(H). In fact, RWR 2 (#) returns ((b^c^la^)?^ 2 ^?^ , which yields the non- 
deterministic ({bcla)lbl) + after stripping. Nevertheless, G is equivalent to the 
deterministic 2-ORE bc?a(ba) + l. 

So although rwr 2 is always guaranteed to return a fc-ORE, it does not provide 
the same strong guarantees that rwr 2 provides (Theorem 4.3). The following theo- 
rem shows, however, that if we can obtain G by applying the Glushkov construction 
on r [Briiggcman-Klein 1993], RWR 2 (G) is always equivalent to r. Moreover, if r 
is deterministic, then so is rwr 2 (G). So in this sense, RWR 2 applies an inverse 
Glushkov construction to r. Formally, the Glushkov construction is defined as 
follows. 

Definition 4.7. Let r be a fc-ORE. Recall from Definition 1.2 that f is the regular 
expression obtained from r by replacing the zth occurrence of alphabet symbol a 
by a^ l \ for every a £ S and every 1 < i < n. Let posir) denote the symbols in T,^ 
that actually appear in f. Moreover, let the sets ftrst(r), last(f), and follower, a^) 
be defined as shown in Figure 6. A fc-OA G is a Glushkov translation of r if there 
exists a one-to-one onto mapping p: (V(G) — {sre, sink}) -4- pos(r) such that 
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first(9) 
first{a<f>) 
first(r+) 

first (r . s) 



Zast(0) 
last(a^) 
last(r+) 

last(f . s) 



/o«ow(a( i ),aW) 
follow (r?,a^) 

follow (r+,oW) 
follower + s, aM) 

follower . s, aW) 



{«<"}_ 
first (r) 

\first(r) 

\first(r) U first(s) 



first(e) 
first(rl) 
first (r + s) 
if eg £(r), 
otherwise. 



first(r) 

first(r) U first(s) 



Zast(e) 

{a«} Zast(r?) 

last(r) last(r + s) 

iost(s) if e g 

last(r) U Zast(s) otherwise. 







last(r) U last(s) 



follow (r, <jM) 
I follow (r, a' 1 )) 
I follow (r, aW) U first (r) 
J follow (r, aM) 
I follow(s, <jW) 
f follower, oW) 



if aW g Jast(r), 
otherwise, 
if a' 8 ' S pos(r), 
otherwise. 

(follow (r, oW) if aW g pos{r),a^> ^ last(r), 

follower , a^) U first(s) if oW 6 pos(r),aW G Jast(f), 
follow(s,a^) otherwise. 



Fig. 6. Definition ol first (r), last(r), and follow(r, aW), for aW 6 pos(r). 

(1) w € Succ(src) <^ p(v) € first (f); 

(2) v € Pred(sinfc) <^ p(u) e last(r); 

(3) w € Succ(w) <^ e follow (r, p(w)); and 

(4) strip(p(v)) = lab(v), 

for all f , w G V(G) — {sre, sinfc}. 



□ 



Theorem 4.8. 7/ fc-OA G is a Glushkov representation of a target fc-ORE 
r, f/ien RWR 2 (G) is equivalent to r. Moreover, if r is deterministic, then so is 
RWR 2 (G). 

PROOF. Since RWR 2 (G) = strip (rwr 2 (77)) for an arbitrarily chosen marking 
77 of G, it suffices to prove that strip (rwr 2 (77)) is equivalent to r and that 
strip (rwRi(H)) is deterministic whenever r is deterministic, for every marking 77 
of G. Hereto, let 77 be an arbitrary but fixed marking of G. In particular, G and 77 
have the same set of nodes V and edges 75, but differ in their labeling function. Let 
Z*a6<3 be the labeling function of G and let Za6# the labeling function of 77. Clearly, 
laba{v) = strip {lab h{v)) for every u e V" — {sre, sinA;}. Since G is a Glushkov 
translation of r, there is a one-to-one, onto mapping p: (V — {sre, sink}) — > posir) 
satisfying properties (l)-(4) in Definition 4.7. Now let a: pos (f) — > £( fe ) be the 
function that maps G pos(f) to labH(p^ 1 (a^)). Since Zo6_y assigns a distinct 
label to each state, u is one-to-one and onto the subset of Y,^ symbols used as 
labels in 77. Moreover, by property (4) and the fact that labc{v) = strip{labn{v)) 
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we have, 

strip(a {i) ) = lab G {p~ 1 (a ii) )) = strip { lab h^ 1 (a W ))) = strip(a{a {i) )) (*) 

for each aW g pos{r). In other words, a preserves (stripped) labels. Now let cr(f) 
be the SORE obtained from r by replacing each aW € pos(r) by a(a^). Since a is 
one-to-one and r is a SORE, so is a(f). Moreover, we claim that C(H) = C(a(f)). 

Indeed, it is readily verified by induction on f that a word ai*- 41 - 1 . . . an' 1 ™' € C(r) 
if, and only if, (i) a^ 11 ^ <E first(r); (ii) a p+1 ^p +1 ' e follow (f, a p+ i ( - lp+1 ' > ) for every 
1 < p < n; and (iii) d n ''"' e last(f). By properties (l)-(4) of Definition 4.7 we 
hence obtain: 

<7(a 1 ^))...a(a„( i «))e£(<7(f)) 
<=> ai( Jl )...a„( 4 ") e £(r) 

src, / 9~ 1 (ai^ 1 )), . . . , /9 _1 (o„^"'), sinA; is a walk in G 
src, p~ 1 (ai^), . . . , p~ 1 (a n ^), sink is a walk in i7 

^(p-Hai^ )) ■ ■ • , IoMP _: W iB) )) G 
a( ai ^))...a(a„^)) e£(ff) 

Therefore, £(#) = £(cx(r)). 

Hence, we have established that H is a SOA over £( fc ) equivalent to the SORE 
a(r) over Y.^. By Theorem 4.3, RWRj(iJ) is hence equivalent to a(f). Therefore, 
strip (kwr1(H)) is equivalent to strip{a{r)), which by (★) above, is equivalent to 
strip (r) = r, as desired. 

Finally, to see that strip (rwr 2 (i/)) is deterministic if r is deterministic, let 
s := strip (kwrI(H)) and suppose for the purpose of contradiction that s is not 
deterministic. Then there exists wa^vi and wa^V2 in £{s) with i =/= j. It is 
not hard to see that this can happen only if there exist w'a^ 1 and w'a^ ^v' 2 
in £(RWR? (i?)) with %' ^ j' '. Since £(RWR 2 (£f)) = C(a{f)) we know that hence 
er-^u/a^M) € £(f) and a' 1 {w' a,W> v' 2 ) e £(f). Let u/'e^'V/ = tr^u/a^V) 
and w/'a^ )u 2 ' = a" _1 (w'a^ ^Uj). Since <r is one-to-one and «' 7^ j', also i" 7^ j". 
Therefore, r is not deterministic, which yields the desired contradiction. □ 

4.3 The whole Algorithm 

Our deterministic regular expression inference algorithm iDRegEx combines iKOA 
and rwr 2 as shown in Algorithm 4. For increasing values of k until a maximum 
fcmax is reached, it first learns a deterministic fc-OA G from the given sample S, 
and subsequently translates that fc-OA into a fc-ORE using rwr 2 . If the resulting 
fc-ORE is deterministic then it is added to the set C of deterministic candidate 
expressions for 5, otherwise it is discarded. From this set of candidate expressions, 
zDRegEx returns the "best" regular expression best(C), which is determined ac- 
cording to one of the measures introduced below. Since it is well-known that, 
depending on the initial value of a, Baum Welsh (and therefore zKoa) may con- 
verge to a local maximum that is not necessarily global, we apply iKOA a number 
of times N with independently chosen random seed values for a to increase the 
probability of correctly learning the target regular expression from S. 

The observant reader may wonder whether we are always guaranteed to derive 
at least one deterministic expression such that best(C) is defined. Indeed, Theo- 
rem 4.8 tells us that if we manage to learn from sample S a fc-OA which is the 
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Algorithm 4 zDRegEx 
Require: a sample S 
Ensure: a fc-ORE r 

l: initialize candidate set G <— 

2: for fc = 1 to fc max do 

3: for n = 1 to N do 

4: G «- iKOA(5, fc) 

5: if RWR 2 (G) is deterministic then 

6: add RWR 2 (G) to G 

7: return best(G) 



Glushkov representation of the target expression r, then rwr 2 will always return 
a deterministic fc-ORE equivalent to r. When fc > 1, there can be several fc-OAs 
representing the same language and we could therefore learn a non-Glushkov one. 
In that case, rwr 2 always returns a fc-ORE which is a super approximation of the 
target expression. Although that approximation can be non-deterministic, since we 
derive fc-OREs for increasing values of fc and since for fc = 1 the result of rwr 2 is 
always deterministic (as every SORE is deterministic), we always infer at least one 
deterministic regular expression. In fact, in our experiments on 100 synthetic reg- 
ular expressions, we derived for 96 of them a deterministic expression with k > 1, 
and only for 4 expressions had to resort to a 1-ORE approximation. 

4.3.1 A Language Size Measure for Determining the Best Candidate. Intuitively, 
we want to select from G the simplest deterministic expression that "best" describes 
S. Since each candidate expression in G accepts all words in S by construction, one 
way to interpret "the best" is to select the expression that accepts the least number 
of words (thereby adding the least number of words to S) . Since an expression de- 
fines an infinite language in general, it is of course impossible to take all words into 
account. We therefore only consider the words up to a length n, where n = 2m + 1 
with m the length of the candidate expression, excluding regular expression opera- 
tors, 0, and e. For instance, if the candidate expression is a .(a + c + )?, then m = 3 
and n = 7. Formally, for a language L, let \L- n \ denote the number of words in L 
of length at most n. Then the best candidate in G is the one with the least value of 
| £(r)- n \. If there are multiple such candidates, we pick the shortest one (breaking 
ties arbitrarily). It turns out that |£(r)- n | can be computed quite efficiently; see 
[Bex et al. ] for details. 

4.3.2 A Minimum Description Length Measure for Determining the Best Candi- 
date. An alternative measure to determine the best candidate is given by Adriaans 
and Vitanyi [2006] , who compare the size of S with the size of the language of a 
candidate r. Specifically, Adriaans and Vitanyi define the data encoding cost of r 
to be: 

n , 

datacost(r, S) := ^ I 2 • log 2 i + log 2 

i=o ^ 

where n = 2m + 1 as before; IS^I is the number of words in S that have length i; 
and | C =t (r)\ is the number of words in C(r) that have exactly length i. Although 
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the above formula is numerically difficult to compute, there is an easier estimation 
procedure; see [Adriaans and Vitanyi 2006] for details. 

In this case, the model encoding cost is simply taken to be its length, thereby 
preferring shorter expressions over longer ones. The best regular expression in the 
candidate set C is then the one that minimizes both model and data encoding cost 
(breaking ties arbitrarily). 

We already mentioned that xtract [Garofalakis et al. 2003] also utilizes the 
Minimum Description Length principle. However, their measure for data encoding 
cost depends on the concrete structure of the regular expressions while ours only 
depends on the language defined by them and is independent of the representation. 
Therefore, in our setting, when two equivalent expressions are derived, the one with 
the smallest model cost, that is, the simplest one, will always be taken. 

5. EXPERIMENTS 

In this section we validate our approach by means of an experimental analysis. 
Throughout the section, we say that a target fc-ORE r is successfully derived when 
a fc-ORE s with C(r) — C{s) is generated. The success rate of our experiments 
then is the percentage of successfully derived target regular expressions. 

Our previous work [Bex et al. 2008] on this topic was based on a version of the 
rwr° algorithm [Bex et al. 2006], we refer to this algorithm as iDR,EGEx(RWR°). 
Unfortunately, as detailed in [Bex et al. 2008], it is not known whether rwr° is 
complete on the class of all single occurrence regular expressions. Nevertheless, the 
experiments in [Bex et al. 2008] which are revisited below show a good and reliable 
performance. However, to obtain a theoretically complete algorithm, c.f.r. Theo- 
rem 4.8, we use the algorithm rwr 2 which is sound and complete on single occur- 
rence regular expressions. In the remainder we focus on zDRegEx, but compare 
with the results for zDRegEx(rwr°). 

As mentioned in Section 4.3.1, another new aspect of the results presented here is 
the use of language size as an alternative measure over Minimum Description Length 
(MDL) to compare candidates. The iDREGEx(RWR°) algorithm is only considered 
with the MDL criterion. We note that for alphabet size 5, the success rate of 
zDRegEx with the MDL criterion was only 21 %, while that of the language size 
criterion is 98 %. The corpus used in this experiment is described in Section 5.3. 
Therefore in the remainder of this section we only consider zDRegEx with the 
language size criterion. 

For all the experiments described below we take fc max = 4 and N = 10 in Algo- 
rithm 4. 

5.1 Running times 

All experiments were performed using a prototype implementation of zDRegEx 
and iDRegEx(rwr°) written in Java executed on Pentium M 2.0 GHz class ma- 
chines equipped with 1GB RAM. For the BaumWelsh subroutine we have grate- 
fully used Jean-Marc Frangois' Jahmm library [Francois 2006], which is a faithful 
implementation of the algorithms described in Rabincr's Hidden Markov Model tu- 
torial [Rabiner 1989]. Since Jahmm strives for clarity rather than performance and 
since only limited precautions are taken against underflows, our prototype should 
be seen as a proof of concept rather than a polished product. In particular, under- 
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flows currently limit us to target regular expressions whose total number of symbol 
occurrences is at most 40. Here, the total number of symbol occurrences occ(r) of 
a regular expression r is its length excluding the regular expression operators and 
parenthesis. To illustrate, the total number of symbol occurrences in aa?b + is 3. 
Furthermore, the lack of optimization in Jahmm leads to average running times 
ranging from 4 minutes for target expressions r with |S(r)| = 5 and occ(r) = 6 to 

9 hours for targets expression with |E(r)| = 15 and occ(r) — 30. Running times for 
zDRegEx and iDRegEx(rwr°) are similar. 

As already mentioned in Section 4.3, one of the bottlenecks of iDREGEx is the ap- 
plication of Baum Welsh in Line 11 of Disambiguate (Algorithm 2). Baum Welsh 
is an iterative procedure that is typically run until convergence, i.e., until the 
computed probability distribution no longer change significantly. To improve the 
running time, we only apply a fixed number £ of iteration steps when calling 
BaumWelsh in Line 11 of Disambiguate. Experiments show that the running 
time performance scales linear with £ as one expects, but, perhaps surprisingly, the 
success rate improves as well for an optimal value of £. This optimal value for £ 
depends on the alphabet size. These improved results can be explained as follows: 
applying BaumWelsh in each disambiguation step until it converges guarantees 
that the probability distribution for that step will have reached a local optimum. 
However, we know that the search space for the algorithm contains many local op- 
tima, and that BaumWelsh is a local optimization algorithm, i.e., it will converge 
to one of the local optima it can reach from its starting point by hill climbing. The 
disambiguation procedure proceeds state by state, so fine tuning the probability 
distribution for a disambiguation step may transform the search space so that cer- 
tain local optima for the next iteration can no longer be reached by a local search 
algorithm such as BaumWelsh. Table I shows the performance of the algorithm 
for various number of BaumWelsh iterations £ for expressions of alphabet size 5, 

10 and 15. These expressions are those described in Section 5.3. In this Table, 
£ = oo denotes the case where BaumWelsh is ran until convergence after each 
disambiguation step. The Table illustrates that the success rate is actually higher 
for small values of £. The running time performance gains increase rapidly with 
the expressions' alphabet size: for |E| = 5, we gain a factor of 3.5 (£ — 2), for 
|S| = 10, it is already a factor of 10 (£ — 3) and for |S| = 15, we gain a factor 
of 25 (£ = 3). This brings the running time for the largest expressions we tested 
down to 22 minutes, in contrast with 9 hours mentioned for iDRegEx(rwr°) and 
zDRegEx. The algorithm with the optimal number of BaumWelsh steps in the 
disambiguation process will be referred to as iDREGEx fixcd . In particular for small 
alphabet sizes (|E| < 7) we use £ = 2, for large alphabet size £ = 3 (|E| > 7). We 
note that the alphabet size can easily be determined from the sample. 

We should also note that Experience with Hidden Markov Model learning in bio- 
informatics [Finn et al. 2006] suggests that both the running time and the maximum 
number of symbol occurrences that can be handled can be significantly improved 
by moving to an industrial-strength BaumWelsh implementation. Our focus for 
the rest of the section will therefore be on the precision of zDRegEx. 
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I 


rate |E| = 5 


rate |S| = 10 


rate |E| = 15 


1 


95 % 


80 % 


40 % 


2 


100 % 


75 % 


50 % 


3 


95 % 


84 % 


60 % 


4 


95 % 


77 % 


50 % 


oo 


98 % 


75 % 


50 % 



Tabic I. Success rate for a limited number of BaumWelsh iterations in the disambiguation pro- 
cedure, Z = co corresponds to jDRegEx, for I = 1, . . . , 4 correspond to iDREGEx flxod . 

5.2 Real-world target expressions and real-world samples 

We want to test how zDRegEx performs on real-world data. Since the number 
of publicly available XML corpora with valid schemas is rather limited, we have 
used as target expressions the 49 content models occurring in the XSD for XML 
Schema Definitions [Thompson et al. 2001] and have drawn multiset samples for 
these expressions from a large corpus of real-world XSDs harvested from the Cover 
Pages [Cover 2003] . In other words, the goal of our first experiment is to derive, from 
a corpus of XSD definitions, the regular expression content models in the schema 
for XML Schema Definitions 2 . As it turns out, the XSD regular expressions are all 
single occurrence regular expressions. 

The ?DRegEx(rwr°) algorithm infers all these expressions correctly, showing 
that it is conservative with respect to k since, as mentioned above, the algorithm 
considers k values ranging from 1 to 4. In this setting, zDRegEx performs not 
as well, deriving only 73 % of the regular expressions correctly. We note that for 
each expression that was not derived exactly, always an expression was obtained 
describing the input sample and which in addition is more specific than the target 
expression. zDRegEx therefore seems to favor more specific regular expressions, 
based on the available examples. 

5.3 Synthetic target expressions 

Although the successful inference of the real-world expressions in Section 5.2 sug- 
gests that zDRegEx is applicable in real-world scenarios, we further test its behav- 
ior on a sizable and diverse set of regular expressions. Due to the lack of real-world 
data, we have developed a synthetic regular expression generator that is parame- 
terized for flexibility. 

Synthetic expression generation. In particular, the occurrence of the regular 
expression operators concatenation, disjunction (+), zcro-or-one (?), zero-or-more 
(*), and one-or-morc ( + ) in the generated expressions is determined by a user- 
defined probability distribution. We found that typical values yielding realistic 
expressions are 1/10 for the unary operators and 7/20 for others. The alphabet 
can be specified, as well as the number of times that each individual symbol should 
occur. The maximum of these numbers determines the value k of the generated 
fc-ORE. 

To ensure the validity of our experiments, we want to generate a wide range of 
different expressions. To this end, we measure how much the language of a generated 



2 This corpus was also used in [Bex et al. 2007] for XSD inference. 
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(debab) + c)*a 
(((c + 6)6) + a)ca) + e + d 
((ea)*db) + b + a + c) + 
(fc + + c + e + d)aab) + 

(((eabh) + d + j+c + 6)+/) + a + 9 + i)? 

({(00) + e)+ + c)6) + 6 + d 

(((d + a)*ea6c6) + c)a)? 

(((ac) + 6 + d)ea6) + c)* 

((((6a6) + c)+ + e)?o) + d) + 

(((ec6)+a) + 6)+ + d + a)? 

(bagbfeid) + c+ a + j + /i)* 

(gdab) + a + i + c + j + e + /) + 7l6 

(h* cdfa) + j + e + g + b + i)*ab 

(g + b + e + f + i + d)* aba) + h + j + c 

(i(h + b + c + j + f)++ e)laaidb) + 9)? 



(((((d6e)*c/) + j)hac) + b + i)* gad 

(({((ihaaj) + d)+ + g)6) + e + 6 + / + c) + 

(((ecgecd) + b + d + a + j + f)*ihaba)* 

(/ + c + d+ m + n)* aojahbegcbfidke 

(((c + 6)a6) + d + i + o)++j + g + / + e + /i 

(((a?clfhabgd) + 6 + n + o)iedjcem)* k 

((a + /c + / + c + m + e) + bdieclbonj gda)* h 

(((&? J ghadf celif cjbhom) + 

6 + g + a + e + i + n)+ + d)? 
(((aedoadenhdbci) + h + k + m + j + g + b)* 

fccgelbifja) 

((o+ + / + d + o + g + n + /t + c + 6 + j + i + e) 

/ceacd/6m) 

(((fc + / + o + a + j)'?edhldf hngicjmab)? cie)* bg 
((((a?d)+6a) + /i + g + e + c)++j + i + 6)?/ 



Fig. 7. A snapshot of the 100 generated expressions. 



expression overlaps with S*. The larger the overlap, the greater its language size 
as defined in Section 4.3.1. 

To ensure that the generated expressions do not impede readability by containing 
redundant subexpressions (as in e.g., (a+)+), the final step of our generator is to 
syntactically simplify the generated expressions using the following straightforward 
equivalences: 



r — > r 
r?? — y r? 



(r+)+ -> r+ 

(r?)+ -> r+? 

(ri • r 2 ) ■ r 3 ^ n • (r 2 • r 3 ) 

ri • (r 2 • r 3 ) -> n • r 2 • r 3 

(ri? • r 2 ?)? -> ri? • r 2 ? 

(ri + r 2 ) + r 3 -4- n + (r 2 + r 3 ) 

fi + (f"2 + r 3) -> »"i + r 2 + r 3 

(r!+r+)+ -> (r 1 +r 2 )+ 

(r+ +r+) -> (n +r 2 )+ 

n + r 2 ? -> (n + r 2 )? 

Of course, the resulting expression is rejected if it is non-deterministic. 

To obtain a diverse target set, we synthesized expressions with alphabet size 5 
(45 expressions), 10 (45 expressions), and 15 (10 expressions) with a variety of 
symbol occurrences (k = 1,2,3). For each of the alphabet sizes, the expressions 
were selected to cover language size ranging from to 1. All in all, this yielded a 
set of 100 deterministic target expressions. A snapshot is given in Figure 7. 

Synthetic sample generation. For each of those 100 target expressions, we 
generated synthetic samples by transforming the target expressions into stochastic 
processes that perform random walks on the automata representing the expressions 
(cf. Section 4) . The probability distributions of these processes are derived from the 
structure of the originating expression. In particular, each operand in a disjunction 
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O > r l ' ' ' r " Q > r l > ■ ■ ■ >■ r n > Q 



O ^— »• ri + ■ ■ ■ + r„ ^— O 



p/2 

P/2 1 

2/3 

p i o 

O— ► r+ — KD r — <) 

Fig. 8. From a regular expression to a probabilistic automaton. 

is equally likely and the probability to have zero or one occurrences for the zero- 
or-one operator ? is 1/2 for each option. The probability to have n repetitions in 
a one-or-more or zero-or-more operator (* and + ) is determined by the probability 
that we choose to continue looping (2/3) or choose to leave the loop (1/3). The 
latter values are based on observations of real-world corpora. Figure 8 illustrates 
how we construct the desired stochastic process from a regular expression r: starting 
from the following initial graph, 




we continue applying the rewrite rules shown until each internal node is an indi- 
vidual alphabet symbol. 

Experiments on covering samples. Our first experiment is designed to test 
how zDRegEx performs on samples that are at least large enough to cover the 
target regular expression, in the following sense. 

Definition 5.1. A sample S covers a deterministic automaton G if for every edge 
(s,t) in G there is a word w € S whose unique accepting run in G traverses (s,t). 
Such a word w is called a witness for (s,t). A sample S covers a deterministic 
regular expression r if it covers the automaton obtained from S using the Glushkov 
construction for translating regular expressions into automata as defined in Defini- 
tion 4.7. 

Intuitively, if a sample does not cover a target regular expression r then there 
will be parts of r that cannot be learned from S. In this sense, covering samples 
are the minimal samples necessary to learn r. Note that such samples are far from 
"complete" or "characteristic" in the sense of the theoretical framework of learning 
in the limit, as some characteristic samples are bound to be of size exponential in 
the size of r by Theorem 3.2, while samples of size at most quadratic in r suffice 
to cover r. Indeed, the Glushkov construction always yields an automaton whose 
number of states is bounded by the size of r. Therefore, this automaton can have 
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at most |r| 2 edges, and hence |r| 2 witness words suffice to cover r. 

Table II shows how iDRegEx performs on covering samples, broken up by alpha- 
bet size of the target expressions. The size of the sample used is depicted as well. 
The table demonstrates a remarkable precision. Out of a total of 100 expressions, 
82 are derived exactly for zDRegEx. Although iDRegEx(rwr°) outperforms 
iDRegEx with a success rate of 87 %, overall zDREGEx fixed performs best with 
89 %. The performance decreases with the alphabet size of the target expressions: 
this is to be expected since the inference task's complexity increases. It should 
be emphasized that even if iDREGEx xcd does not derive the target expression 
exactly, it always yields an over-approximation, i.e., its language is a superset of 
the target language. 

Table III shows an alternative view on the results. It shows the success rate as a 
function of the target expression's language size, grouped in intervals. In particular, 
it demonstrates that the method works well for all language sizes. 

A final perspective is offered in Table IV which shows the success rate in function 
of the average states per symbol k for an expression. The latter quantity is defined 
as the length of the regular expression excluding operators, divided by the alpha- 
bet size. For instance, for the expression a(a + b) + cab, k = 6/3 since its length 
excluding operators is 6 and |E| = 3. It is clear that the learning task is harder 
for increasing values of k. To verify the latter, a few extra expressions with large k 
values were added to the target expressions. For the algorithm iDREGEx fixcd the 
success rate is quite high for target expressions with a large value of k. Conversely, 
zDRegEx(rwr°) yields better results for k < 1.6, while its success rate drops to 
around 50 % for larger values of k. This illustrates that neither zDRegEx(rwr°) 
nor iDRegEx xcd outperforms the other in all situations. 





#regex 


iDREC;Ex(RWR°) 


iDRegEx 


iDREGEx flxed 


|S| 


5 


45 


86 % 


97 % 


100 % 


300 


10 


45 


93 % 


75 % 


84 % 


1000 


15 


10 


70 % 


50 % 


60 % 


1500 


total 


100 


87 % 


82 % 


89 % 





Tabic II. Success rate on the target regular expressions and the sample size used per alphabet size 
for the various algorithms. 



Density (r) 


#regex 


iDREGEx(RWR°) 


iDREGEx 


iDREGEx flxed 


[0.0,0.2[ 


24 


100 % 


87 % 


96 % 


[0.2,0.4[ 


22 


82 % 


91 % 


91 % 


[0.4,0.6[ 


20 


90 % 


75 % 


85 % 


[0.6,0.8[ 


22 


95 % 


72 % 


83 % 


[0.8, 1.0] 


12 


83 % 


78 % 


78 % 



Table III. Success rate on the target regular expressions, grouped by language size. 

It is also interesting to note that iDREGEx successfully derived the regular ex- 
pression ri = (0102 + a 3 + • • • + a„) + of Theorem 3.2 for n = 8, n = 10, and n = 12 
from covering samples of size 500, 800, and 1100, respectively. This is quite surpris- 
ing considering that the characteristic samples for these expressions was proven to 
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K 


#rcgcx 


«DRegEx(rwr°) 


jDRegEx 


iDREGEx flxed 


[1.2, 1.4[ 


29 


96 % 


72 % 


83 % 


[1.4, 1.6[ 


37 


100 % 


89 % 


89 % 


[1.6, 1.8[ 


24 


91 % 


92 % 


100 % 


[1.8,2.0[ 


11 


54 % 


91 % 


100 % 


[2.0, 2.5[ 


12 


41 % 


50 % 


50 % 


[2.5,3.0] 


18 


66 % 


71 % 


78 % 



Table IV. Success rate on the target regular expressions, grouped by k, the average number of 
states per symbol. 

be of size at least (n - 2)!, i.e., 720, 40320, and 3628800 respectively. The regular 
expression r<i = (X \ ai) + ai(£ \ ai) + , in contrast, was not derivable by iDRegEx 
from small samples. 

Experiments on partially covering samples. Unfortunately, samples to learn 
regular expressions from are often smaller than one would prefer. In an extreme, but 
not uncommon case, the sample does not even entirely cover the target expression. 
In this section we therefore test how iDREGEx performs on such samples. 

Definition 5.2. The coverage of a target regular expression r by a sample 5* is 
defined as the fraction of transitions in the corresponding Glushkov automaton for 
r that have at least one witness in S. 

Note that to successfully learn r from a partially covering sample, jDRegEx 
needs to "guess" the edges for which there is no witness in S. This guessing capa- 
bility is built into iDRegEx(rwr°) and zDRegEx in the form of repair rules [Bex 
et al. 2006; Bex et al. 2008]. Our experiments show that for target expressions 
with alphabet size |S| = 10, this is highly effective for zDRegEx(rwr°): even at a 
coverage of 70%, half the target expressions can still be learned correctly as Table V 
shows. The algorithm iDRegEx is performing very poorly in this setting, being 
only successful occasionally for coverages close to 100 %. iDREGEx fixod performs 
better, although not as well as iDREGEx(RWR°). This again illustrates that both 
algorithms have their merits. 



coverage 


iDR.EGEx(RWR°) 


iDRegEx 


iDREGEx flxed 


1.0 


100 % 


80 % 


80 % 


0.9 


64 % 


20 % 


60 % 


0.8 


60 % 


% 


40 % 


0.7 


52 % 


% 


% 


0.6 


% 


% 


% 



Table V. Success rate for 25 target expressions for |S| = 10 for samples that provide partial 
coverage of the target expressions. 

We also experimented with target expressions with alphabet size |S| = 5. In this 
case, the results were not very promising for iDREGEx(RWR°), but as Table VI 
illustrates, iDRegEx and iDREGEx fixed performs better, on par with the target 
expressions for |S| = 10 in the case of iDREGEx fixcd . This is interesting since 
the absolute amount of information missing for smaller regular expressions is larger 
than in the case of larger expressions. 
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coverage 


iDRegEx(rwr°) 


iDRegEx 


«DREGEx flxed 


1.0 


100 % 


100 % 


100 % 


0.9 


25 % 


75 % 


66 % 


0.8 


16 % 


75 % 


41 % 


0.7 


8 % 


25 % 


33 % 


0.6 


8 % 


25 % 


17 % 


0.5 


% 


8 % 


17 % 



Table VI. Success rate for 12 target expressions for |E| = 5 with partially covering samples. 

6. CONCLUSIONS 

We presented the algorithm jDRegEx for inferring a deterministic regular expres- 
sion from a sample of words. Motivated by regular expressions occurring in practice, 
we use a novel measure based on the number k of occurrences of the same alphabet 
symbol and derive expressions for increasing values of k. We demonstrated the 
remarkable effectiveness of ^DRegEx on a large corpus of real- world and synthetic 
regular expressions of different densities. 

Our experiments show that iDREGEx(RWR°) performs better than zDRegEx 
for target expressions with a k < 1.6 and vice versa for larger values of k. For 
partially covering samples, iDREGEx(RWR°) is more robust than iDRegEx. As k 
values and sample coverage are not known in advance, it makes sense to run both 
algorithms and select the smallest expression or the one with the smallest language 
size, depending on the application at hand. 

Some questions need further attention. First, in our experiments, iDRegEx 
always derived the correct expression or a super-approximation of the target ex- 
pression. It remains to investigate for which kind of input samples this behavior 
can be formally proved. Second, it would also be interesting to characterize pre- 
cisely which classes of expressions can be learned with our method. Although the 
parameter n explains this to some extend, we probably need more fine grained 
measures. A last and obvious goal for future work is to speed up the inference of 
the probabilistic automaton which forms the bottleneck of the proposed algorithm. 
A possibility is to use an industrial strength implementation of the Baum- Welsh 
algorithm as in [Finn et al. 2006] rather than a straightforward one or to explore 
different methods for learning probabilistic automata. 

Although zDRegEx can be directly plugged into the XSD inference engine iXSD 
of [Bex et al. 2007], it would be interesting to investigate how to extend these 
techniques to the more robust class of Relax NG schemas [Clark and Murata 2001]. 
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